CuMAPz: Analyzing the Efficiency of Memory Access Pattern in CUDA

نویسندگان

Yooseong Kim

Aviral Shrivastava

چکیده

Even though the entry barrier of writing a GPGPU program is lowered with the help of many high-level programming models, such as NVIDIA CUDA, it is still very difficult to optimize a program so as to fully utilize the given architecture’s performance. The burden of GPGPGU programmers is increasingly growing as they have to consider many parameters, especially on memory access pattern, and even a small change of those parameters can lead to a drastic performance change, which is not obvious, or often counterintuitive, before careful analysis. In this paper, we focus on optimizing a CUDA program using shared memory. We present a tool that analyzes the efficiency of given parameters on memory access pattern. Given a set of parameters, the tool analyzes data reuse, global memory access coalescing, shared memory bank conflict, partition camping, and branch path divergence. The output of the tool is profitability, a comprehensive performance metric introduced in this paper. Profitability can be used to compare the efficiencies of different sets of parameters, without even writing a program. Experimental results show that profitability can accurately predict the change of the performance of a program as we change the memory access pattern related parameters.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Performance evaluation of GPU memory hierarchy using the FFT

Modern GPUs (Graphics Processing Units) are becoming more relevant in the world of HPC (High Performance Computing) thanks to their large computing power and relative low cost, however their special architecture results in more complex programming. To take advantage of their computing resources and develop efficient implementations is essential to have certain knowledge about the architecture a...

متن کامل

Generating GPU Code from a High-Level Representation for Image Processing Kernels

We present a framework for representing image processing kernels based on decoupled access/execute metadata, which allow the programmer to specify both execution constraints and memory access pattern of a kernel. The framework performs source-to-source translation of kernels expressed in highlevel framework-specific C++ classes into low-level CUDA or OpenCL code with effective device-dependent ...

متن کامل

Parallelization of Rich Models for Steganalysis of Digital Images using a CUDA-based Approach

There are several different methods to make an efficient strategy for steganalysis of digital images. A very powerful method in this area is rich model consisting of a large number of diverse sub-models in both spatial and transform domain that should be utilized. However, the extraction of a various types of features from an image is so time consuming in some steps, especially for training pha...

متن کامل

A new approach to the lattice Boltzmann method for graphics processing units

Emerging many-core processors, like CUDA capable nVidia GPUs, are promising platforms for regular parallel algorithms such as the Lattice Boltzmann Method (LBM). Since global memory on graphic devices shows high latency and LBM is data intensive, memory access pattern is an important issue to achieve good performances. Whenever possible, global memory loads and stores should be coalescent and a...

متن کامل

AutoGPU : Automatic Generation of CUDA Kernel Code

Manual optimization of a CUDA kernel can be an arduous task, even for the simplest of kernels. The CUDA programming model is such that a high performance may only be achieved if memory accesses in the kernel follow certain patterns; further, fine-tuning of the kernel execution and loop configuration may result in a dramatic increase in performance. The number of possible such configurations mak...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2010

CuMAPz: Analyzing the Efficiency of Memory Access Pattern in CUDA

نویسندگان

چکیده

منابع مشابه

Performance evaluation of GPU memory hierarchy using the FFT

Generating GPU Code from a High-Level Representation for Image Processing Kernels

Parallelization of Rich Models for Steganalysis of Digital Images using a CUDA-based Approach

A new approach to the lattice Boltzmann method for graphics processing units

AutoGPU : Automatic Generation of CUDA Kernel Code

عنوان ژورنال:

اشتراک گذاری